Multivariate estimation of substructure amplitudes for a single-wavelength anomalous diffraction experiment

A new equation for the calculation of substructure-factor amplitudes for substructure detection from a single-wavelength anomalous diffraction experiment produces better results compared with the currently used estimates in test cases.


Introduction
In determining a macromolecular crystal structure solely from its anomalous signal, the first step is to determine the position of the anomalous substructure that is present. The application of direct methods combined with Patterson techniques, as implemented, for example, in the programs SHELXD (Schneider & Sheldrick, 2002) and HySS (Grosse- Kunstleve & Adams, 2003), or the application of phase-retrieval techniques as implemented in PRASA (Skubá k, 2018) have proven to be very powerful in detecting anomalous substructures, particularly when the anomalous substructure contains many atoms or the signal is very weak.
In all of these approaches, in order to detect the anomalous substructure an estimate of the substructure-factor amplitude |F a | is required. The absolute value of the Bijvoet difference (ÁF = ||F + | À |F À ||) is typically input to substructure-detection programs as an estimate for |F a |.
To improve the methods further, here we propose new formulas and a new refinement strategy to calculate |F a | values. Previously, Terwilliger (1994) and Burla et al. (2002Burla et al. ( , 2003 employed Bayesian and multivariate approaches to obtain the probability distribution of |F a |. Here, we expand on their work and derive a probability distribution for P(|F a |; |F + |, |F À |) that takes into account measurement errors in |F + | and |F À | and does not assume any relationship between the Friedel phases. We report that at least in our practical implementation, better results were obtained by using the approximation of Burla and coworkers, probably due to numerical stability issues of the more general equation. Furthermore, we propose the maximum-likelihood refinement of errors and scale parameters to obtain the optimal values, given the distributions that we have obtained. Finally, we apply the newly implemented |F a | estimation to over 180 test cases and show the superior performance of these estimates compared with the ÁF values when used by the substructure-determination program PRASA.

Methods
To obtain an estimate of the substructure-factor amplitude |F a | from a SAD experiment, the expected value of |F a | given the observations |F + | and |F À | is required. Let F + denote a structure factor with Miller indices h, k, l; (F À )* denote the complex conjugate of a structure factor with Miller indices Àh, Àk, Àl; F a denote a substructure factor with Miller indices h, k, l; and + and À denote the phases of F + and (F À )*, which we will refer to as Friedel pair phases. Then, assuming a complex multivariate Gaussian distribution for P[F a , F + , (F À )*], the following expression can be obtained: hjF a j; jF þ j; jF À ji exp & À 2jF þ jjF À j a 23 À a 12 a 13 þ b 12 b 13 a 11 cosðÞ À b 23 þ a 13 b 12 À a 12 b 13 a 11 sinðÞ !' where ðjF þ j; jF À j; ; AEÞ ¼ ða 2 12 þ b 2 12 ÞjF þ j 2 ða 2 13 þ b 2 13 ÞjF À j 2 a 11 þ 2jF þ jjF À jða 12 a 13 þ b 12 b 13 Þ cosðÞ a 11 þ 2jF þ jjF À jða 13 b 12 À a 12 b 13 Þ sinðÞ a 11 : The above expression is derived in Appendix A; it does not assume + = À as was required in earlier publications (Burla et al., 2002(Burla et al., , 2003, it incorporates the effect of measurement errors in the observed Friedel pair amplitudes and it can be calculated by a single numerical integration. In the above expression, AE is the (Hermitian) covariance matrix of the complex Gaussian distribution P[F a , F + , (F À )*], with the elements of its inverse denoted z jk = a jk + ib jk , = + À À , È(x, y, z) is the Kummer confluent hypergeometric function and I 0 is the modified Bessel function of the first kind and of zero order. The covariance matrix AE was calculated using the expressions derived previously (Pannu et al., 2003) and the correlation between structure factors. To ensure that the matrix remains positive definite, the inverse of the covariance matrix was calculated from the eigenvalues and eigenvectors calculated from LAPACK routines (Anderson et al., 1999) to remove negative eigenvalues.
We have implemented two equations based on equation (1) in a new program Afro for the multivariate estimation of |F a | values. One equation is equation (1) itself, while the other is a simplified form of equation (1) using the Friedel pair phase equality assumption as suggested by Burla et al. (2002Burla et al. ( , 2003: hjF a j; jF þ j; jF À ji ¼ We have found that the simpler equation (3), i.e. assuming that the Friedel pair phases are equal, led to better performance in the test cases shown below, which is likely to be due to improved numerical stability. Thus, results from the implementation of this equation are shown below. The covariance matrix AE depends on both the number and the (overall) temperature factor of the substructure atoms. As these parameters are usually unknown, a likelihood estimate is obtained by Afro. Thus, after initial estimates of the number and the overall temperature factor of the substructure atoms have been input, the parameters are refined using the marginal distribution P(|F + |, |F À |). The refinement of these parameters turned out to have a large radius of convergence, and better results were obtained when refined values were used compared with when unrefined values. We have previously discussed the procedure (Pannu, 2007) and a similar approach was recently reported by Hatti et al. (2021). After the refinement, the |F a | values are estimated using equation (1). Local scaling (Blessing, 1997) has been also implemented in Afro which scales |F + | to |F À | in local spheres.
The multivariate |F a | calculation using the Friedel pair phase equality assumption as implemented in Afro was tested on a sample of 182 SAD data sets as specified in Appendix B containing a large number of anomalous scatterers (selenium, sulfur, iodine, zinc, gold, copper, platinum, krypton, manganese, iron, cadmium, nickel, calcium and mercury) and a large range of data resolutions from 0.94 to 3.9 Å . For each data set, a complete Crank2 (Skubák & Pannu, 2013) structure-solution run was performed, with Afro being used for the calculation of |F a | and E (normalized |F a |), PRASA being used for substructure determination and REFMAC5 (Nicholls et al., 2018), Parrot (Cowtan, 2010), Buccaneer (Cowtan, 2008) and SHELXE (Usó n & Sheldrick, 2018) being used in the subsequent combined phasing, density modification and model building. Versions of the programs corresponding to CCP4 (Winn et al., 2011) version 8.0.002 were used, except for Crank2, where the more recent version 2.0.325 was used, and a bug fix in REFMAC5 implemented by us to prevent the program from crashing for very large data sets.
The input to Crank2 consisted of the SAD data set, the protein sequence and a specification of the anomalously scattering atom type with anomalous scattering coefficients. For five data sets, a value of the solvent content corresponding to the correct number of monomers in the asymmetric unit was specified, otherwise the default options were used. An incorrect solvent-content estimate would not affect the |F a | research papers estimation as it is not used in it; however, since it is an important phase-improvement parameter, it would lead to 'randomly' incomplete models for data sets that could otherwise be automatically built, thus making the model-building analysis less relevant.
For each data set we calculated the overall correlation of the estimated E values with the 'final' substructure E values in the following way. The final anomalously scattering substructure (either deposited or, if not available, determined from the anomalous difference maps) was input to REFMAC5 using 0 refinement cycles. The calculated amplitudes from REFMAC5 were then input to ECALC from CCP4 (Ian Tickle, unpublished work), providing the final substructure E values. The correlation between the estimated and final E values was calculated using the SFTOOLS utility from CCP4 (Bart Hazes, unpublished work), which divided the data-set reflections into 20 resolution bins and calculated the correlations per resolution bin. Finally, an average of the bin correlations up to 'anomalous resolution' was calculated. The anomalous resolution was determined once for each data set, corresponding to the lowest resolution (the largest number) included in those resolution bins in which the correlation between the multivariate E values and the final E values was smaller than 0.05 and an average of correlations from three consecutive resolution bins was smaller than 0.05.
Estimation of E values from Friedel pair differences (ÁE) was also implemented in Afro and was tested on the 182 SAD data sets to compare its performance against the multivariate estimation. Complete structure solution from ÁE was attempted with Crank2 using the same pipeline and default options as used in the runs from multivariate Afro.
The anomalous substructure obtained by PRASA is considered to be 'correctly determined' if at least one third of the atoms in the final anomalous substructure had a matching atom (within 2 Å distance) in the substructure obtained after transformation by SITCOM (Dall'Antonia & Schneider, 2006). Similarly to as in Skubá k (2018), we have observed that typically if approximately 1/3 of the substructure atoms have been correctly identified in substructure determination, the remaining significant anomalous scatterers can either be added by Crank2 from the anomalous maps or their absence does not affect the success of model building.
The model-building performance is judged by the fraction of the PDB-deposited model backbone that is 'correctly built'. A residue is considered to be correctly built if its C position is at a distance of at most 2 Å from a deposited model C ('C -deposited') position and a neighbouring C position is at a distance of at most 2 Å from a neighbour of the Cdeposited position (sequence identity or directionality is not checked). A custom script evaluating the model-building performance using these criteria was used.
For all data sets where one of the pipelines failed to determine the substructure, a 'thorough' substructuredetermination protocol was tested: the number of PRASA trials was increased to 100 000 trials from the default maximum of 2000 trials, more high-resolution cutoffs were tested (the high-resolution cutoff step was decreased to 0.1 from the default of 0.25) and the initial high-resolution cutoff was set to be identical to the anomalous resolution. The thorough protocol aims to estimate whether it is possible to determine the substructure by PRASA from the input E values at all.

Results and discussion
The correlation of the multivariate E values estimated by Afro with the final substructure E is typically significantly larger than that for ÁE, as demonstrated by Fig. 1. In tests on the 182 SAD data sets, the average correlation improved by 13% (from 0.197 to 0.223) and an improved correlation was observed for 94% of the data sets.
The overall better quality of the E estimates calculated by Afro allowed successful substructure determination by PRASA for six data sets that did not work using ÁE. As summarized in Table 1, the total number of data sets with the substructure correctly determined increased from 162 (89.0%) using ÁE to 168 (92.3%) using multivariate Afro. If these six The correlation of ÁE (x axis) and multivariate E from Afro (y axis) with the 'final' substructure E for each of the 182 tested data sets. The data sets for which the substructure was correctly determined from the multivariate E but not from ÁE are displayed in black (comparing the results of the default substructure-determination protocol) and magenta (the 'thorough' substructure-determination protocol). Table 1 Number of data sets for which the substructure was determined and the majority of the model was built by the two tested pipelines: starting from E calculated as Friedel pair differences and by multivariate Afro.
The first number in each cell denotes the number of successes using the default substructure-determination protocol and the second number that using the 'thorough' substructure-determination protocol with a substantially larger number of trials and a larger number of high-resolution cutoffs.
No. of data sets (default/thorough) A majority of the model was correctly built for 156 data sets (85.7%) starting from substructure determination using ÁE and for 161 data sets (88.5%) starting from substructures determined by multivariate E from Afro.
Using the 'thorough' substructure-determination protocol with a large number of substructure trials and resolution cutoffs for the data sets where substructure determination failed led to the determination of another two substructures starting from the multivariate Afro. Similarly, one more substructure could be determined using the thorough protocol starting from ÁE; this substructure was obtained starting from the multivariate E using the default protocol.
In total (default + thorough protocol), seven substructures were determined from the multivariate E values that were not determined from the ÁE values. Furthermore, determination of one other substructure required the thorough protocol starting from ÁE, while the default protocol was sufficient if multivariate Afro was used. Analysis of the success rates for this data set (PDB entry 2pgc) shows that this was not a coincidence: only four solutions were obtained in 100 000 trials from ÁE (a success rate of 1 in 25 000) and 27 solutions were obtained using the multivariate Afro (1 in 3704).
The data sets used in this paper may not be fully representative of user data. In particular, a large fraction (almost 45%) of the data sets come from the automated JCSG pipeline (Elsliger et al., 2010), which may differ from more recent datacollection methods. Furthermore, a limited number of data sets for which the structure could not be solved are included in the sample used for the paper; such data sets are typically neither deposited nor shared. Thus, the differences in results between the pipelines should not be considered as a quantitative estimate of success-rate improvement for user data but rather as qualitative evidence that the improved |F a | and E estimates by Afro may lead to successful substructure determination and model building for data sets where it failed using ÁE.
The multivariate |F a | estimation by Afro has been integrated into the Crank2 pipeline for automated structure solution from experimental phases and is distributed as part of the CCP4 package, which is available as a binary and as open source.